{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis and evaluation of PII regexes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we do some exploratory analysis of our PII detection tools using the annotated dataset, we observe that:\n",
    "1. For SSH & API keys detection:\n",
    "* detect-secrets tool has a very low recall (detects 2 out of 37 keys). Update: this tool apparently needs context even though it uses regexes. TODO: run another evaluation of the whole code files instead of just instances.\n",
    "* our regex for keys: https://regex101.com/r/pndDnd/1 : \n",
    "    * has a high recall (detects 28 out of 37 keys, and probably more because some keys had dots inside so the regex split them into multiple keys)\n",
    "    * has many false positives that are paths, words attached by \":\" or \"_\" you can find them in `./experiments/before_gibberish_keys.txt`\n",
    "    * one solution to increase precision was to use a **Gibberish detector**, if a detected key is not labeled as gibberish (not word like) we don't keep it,\n",
    "    this removes 174 false positives that you can find in `./experiments/before_gibberish_keys.txt`\n",
    "    * **TO IMPROVE:**\n",
    "    * Some hashes are not labelled as gibberish by the gibberish detector(=> not filtered), not sure if they are really secrets, for an example see `./experiments/file_with_hashes.txt` (some other hashes -from that file- are filtered though)\n",
    "    * There are still some false positives like name/path (labeled as gibberish) in this format \"e2e_mask_rcnn_X-152-32x8d-FPN-IN5k_1.44x\" and \"//deno.land/std@0.142.0/testing/asserts.ts\"\n",
    "    * If there is an \"=\" or \"id=\" in front of the key it is detected\n",
    "    * Some instances like \"f47dbc9c:\" and \"dc22a3aa:\" are detected, tehy seems like ids of patch releases, their context is saved in `./experiments/short_keys_patch_releases.txt`\n",
    "    * You can check all detected keys by looking for 'KEY' tags in `./experiments/list_detected_pii.txt` \n",
    "* TODO: get precision numbers and try adding more filters (from detect-secrets fore example)\n",
    "2. For email detection:\n",
    "* **TO IMPROVE:**\n",
    "* our regex https://regex101.com/r/8CsR5P/1 and the updated bigscience regex https://regex101.com/r/LNwpG1/1 labelled a lot of samples like \"dusk-network/icon@4.5.0\" as emails\n",
    "* the updated bigscience regex doesn't detect well emails with are between \"<\" and \">\" as in `<email>`.\n",
    "* our regex detected noreply@127.0.0.1 as an email\n",
    "* both regexes have a high recall on the list of emails we detected (without delimiters)\n",
    "* TODO: more comparison of the two regexes and precision/recall numbers and use of context detection\n",
    "3. IP addresses: TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c44b7df2d9724f32a249f38d8cc69d56",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading readme:   0%|          | 0.00/896 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using custom data configuration loubnabnl--pii-instances-b56ea5fc2b13487c\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading and preparing dataset None/None to /Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "628fb39584cf40ae8746479f798e69b0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "af284acc0f4f4b0a802ec577c7f9919b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading data:   0%|          | 0.00/17.1k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2788dacaa51a41bba0ef397b2aaf80b1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5e548f9ecb1c43ccaa99d444e5216c2a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0 tables [00:00, ? tables/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset parquet downloaded and prepared to /Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "# this dataset has lists of PII without context\n",
    "ds = load_dataset(\"bigcode/pii-instances\", use_auth_token=True, split=\"train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.emails_ip_addresses_detection import detect_email_addresses\n",
    "from utils.keys_detection import detect_keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# small test\n",
    "text = \"\"\"this is a test example with an email random@hf.co and address 10.1.1.1\n",
    "          aws_access_key_id=AKIAIOSFODNN7EXAMPLE\n",
    "          aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\n",
    "          randomstring=b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABDHn\"\"\"\n",
    "keys = detect_keys(text)\n",
    "emails_ip_adresses = detect_email_addresses(text, new_email_regex=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'tag': 'AWS Access Key', 'value': 'AKIAIOSFODNN7EXAMPLE', 'start': 99, 'end': 119}]\n"
     ]
    }
   ],
   "source": [
    "print(keys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'tag': 'EMAIL', 'value': 'random@hf.co', 'start': 37, 'end': 49}, {'tag': 'IP_ADDRESS', 'value': '10.1.1.1', 'start': 62, 'end': 70}]\n"
     ]
    }
   ],
   "source": [
    "print(emails_ip_adresses)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'tag': 'KEY',\n",
       "  'value': '=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',\n",
       "  'start': 151,\n",
       "  'end': 192},\n",
       " {'tag': 'KEY',\n",
       "  'value': '=b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABDHn',\n",
       "  'start': 215,\n",
       "  'end': 281}]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "other_keys = detect_email_addresses(text, tag_types={\"KEY\"}, new_email_regex=False)\n",
    "other_keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tests\n",
    "Let's test the pipelines on our PII instances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to an issue with LighTag, we can't download annotations with the reviews, we will need to manually clean the samples and remove empty samples we added to the dataset to have the same number of rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# select sample correct api samples\n",
    "ds_api = ds.select([i for i in range(17) if i not in [7, 8, 13, 14, 15]] + [i for i in range(43, 80) if i not in [44, 63, 69, 72]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "45"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(ds_api[\"API_KEY\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "api_keys_clean = set(ds_api[\"API_KEY\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(api_keys_clean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['6622386f8d83dc9efefb8c03a4dbfc18e7928d89ffc2ec3e2feb9473e8f410c9',\n",
       " '546d57b6c88c2be7517759c016c0bf0313dfcc14adfcb43967f3c5d24657f366',\n",
       " '76d8ae334545bbdf2db49414c25d2cfd8685e7b6187f119b28e93ad9c5118e9d',\n",
       " '43e0352fee07fa5b92dd22e557cb1d050ccde0cf97273e02f694930695b15134',\n",
       " 'c9eb8a1102d0a68cafc93f22df73445b8f69706f3322285f9a2f623a28df0176',\n",
       " 'eff634a68a01d081c0bdc51752dfa0709781f0e4',\n",
       " '4d986a461d1b24bb5776fb49063b9a1891939f336b306a6bc75f58d0a4e98bcb']"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ssh_keys_clean = ds[\"SSH_KEY\"][:7]\n",
    "ssh_keys_clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses_clean = ds[\"IP_ADDRESS\"][:95]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Detect API keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "detect_secrets_results = []\n",
    "regexes_results = []\n",
    "detect_secrets_nb = 0\n",
    "regexes_nb = 0\n",
    "for key in api_keys_clean:\n",
    "    output_1 = detect_keys(key)\n",
    "    output_2 = detect_email_addresses(key, tag_types={\"KEY\"}, new_email_regex=False)\n",
    "    if output_1:\n",
    "        detect_secrets_nb += 1\n",
    "        detect_secrets_results.append(output_1)\n",
    "    if output_2:\n",
    "        regexes_nb += 1\n",
    "        regexes_results.append(output_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nb dectected by detect-secrets: 2\n",
      "nb dectected by regexes: 21\n",
      "number true API keys: 30\n"
     ]
    }
   ],
   "source": [
    "print(f\"nb dectected by detect-secrets: {detect_secrets_nb}\")\n",
    "print(f\"nb dectected by regexes: {regexes_nb}\")\n",
    "print(f\"number true API keys: {len(api_keys_clean)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[{'tag': 'JSON Web Token',\n",
       "   'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDE0NDE3fQ.',\n",
       "   'start': 0,\n",
       "   'end': 192}],\n",
       " [{'tag': 'JSON Web Token',\n",
       "   'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDEzNTUxfQ.',\n",
       "   'start': 0,\n",
       "   'end': 192}]]"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "detect_secrets_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "detect-secrets has a very low recall: 2 out of 30, let's anlyze the regex detections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "wrong detection at 6 for ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-RC4-SHA:ECDHE-RSA-AES128-SHA:AES128-GCM-SHA256:RC4:HIGH:\n",
      "\n",
      "\n",
      "wrong detection at 7 for 476611152863-ltgqfk9jhq1vsenin5039n58ogkraltb\n",
      "\n",
      "\n",
      "detection was split at 11 for [{'tag': 'KEY', 'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9', 'start': 0, 'end': 36}, {'tag': 'KEY', 'value': 'eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDE0NDE3fQ', 'start': 37, 'end': 191}, {'tag': 'KEY', 'value': 'D8ja66bnLxJ3bsJlaKRtOquu8XbibjNCyFxJpI7vafc', 'start': 192, 'end': 235}]\n",
      "\n",
      "\n",
      "detection was split at 12 for [{'tag': 'KEY', 'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9', 'start': 0, 'end': 36}, {'tag': 'KEY', 'value': 'eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDEzNTUxfQ', 'start': 37, 'end': 191}, {'tag': 'KEY', 'value': '8X-fBRUHdrwtkTLcOFAsW-vvvqCzmkZKM2gQgHNkBKk', 'start': 192, 'end': 235}]\n",
      "\n",
      "Number of correctly detected strings: 17\n"
     ]
    }
   ],
   "source": [
    "res = 0\n",
    "values = []\n",
    "for i, elem in enumerate(regexes_results):\n",
    "    if len(elem) != 1:\n",
    "        print(f\"\\ndetection was split at {i} for {elem}\\n\")\n",
    "    else:\n",
    "        value = elem[0][\"value\"]\n",
    "        if value in api_keys_clean:\n",
    "            res += 1\n",
    "            values.append(value)\n",
    "        else:\n",
    "            print(f\"\\nwrong detection at {i} for {value}\\n\")\n",
    "print(f\"Number of correctly detected strings: {res}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "missed keys\n",
      "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN0123456789\n",
      "ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-RC4-SHA:ECDHE-RSA-AES128-SHA:AES128-GCM-SHA256:RC4:HIGH:!MD5:!aNULL:!EDH:!CAMELLIA\n",
      "476611152863-ltgqfk9jhq1vsenin5039n58ogkraltb.apps.googleusercontent.com\n",
      "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN\n",
      "mfxl'vmsdv';mfdb'fdamlmdsvfdkfnjn\n",
      "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDE0NDE3fQ.D8ja66bnLxJ3bsJlaKRtOquu8XbibjNCyFxJpI7vafc\n",
      "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDEzNTUxfQ.8X-fBRUHdrwtkTLcOFAsW-vvvqCzmkZKM2gQgHNkBKk\n",
      "2(iwreobf4b(-=h_p=^!obgxdgn3_*s!17=_3wc4dun9_y^q+c\n",
      "rSHvhgdOQUB4KMc5JS1alzhg\n",
      "6595b64144ccf1df\n",
      "AIzaasdf\n",
      "0123456789abcdefghijklmno\n",
      "ABCDEFGHIJKLMNabcdefghijklmnopqrstuvwxyz0123456789\n"
     ]
    }
   ],
   "source": [
    "print(\"missed keys\")\n",
    "# three of them were just truncated because they contained dots inside: not sure they are real keys\n",
    "for key in api_keys_clean:\n",
    "    if key not in values:\n",
    "        print(key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Detect SSH keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "detect_secrets_results = []\n",
    "regexes_results = []\n",
    "detect_secrets_nb = 0\n",
    "regexes_nb = 0\n",
    "for key in ssh_keys_clean:\n",
    "    output_1 = detect_keys(key)\n",
    "    output_2 = detect_email_addresses(key, tag_types={\"KEY\"}, new_email_regex=False)\n",
    "    if output_1:\n",
    "        detect_secrets_nb += 1\n",
    "        detect_secrets_results.append(output_1)\n",
    "    if output_2:\n",
    "        regexes_nb += 1\n",
    "        regexes_results.append(output_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nb dectected by detect-secrets: 0\n",
      "nb dectected by regexes: 7\n",
      "number true ssh keys: 7\n"
     ]
    }
   ],
   "source": [
    "print(f\"nb dectected by detect-secrets: {detect_secrets_nb}\")\n",
    "print(f\"nb dectected by regexes: {regexes_nb}\")\n",
    "print(f\"number true ssh keys: {len(ssh_keys_clean)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of correctly detected strings: 7\n"
     ]
    }
   ],
   "source": [
    "res = 0\n",
    "values = []\n",
    "for i, elem in enumerate(regexes_results):\n",
    "    if len(elem) != 1:\n",
    "        print(f\"\\ndetection was split at {i} for {elem}\\n\")\n",
    "    else:\n",
    "        value = elem[0][\"value\"]\n",
    "        if value in ssh_keys_clean:\n",
    "            res += 1\n",
    "            values.append(value)\n",
    "        else:\n",
    "            print(f\"\\nwrong detection at {i} for {value}\\n\")\n",
    "print(f\"number of correctly detected strings: {res}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remarks & questions: \n",
    "* some of the keys missed included dots, can API keys include dots? \n",
    "* add this regex to detect-secrets plugins to use filters on top of it ?\n",
    "Observations:\n",
    "* detect-secrets is not able to detect most API keys and all SSH keys\n",
    "* our regex for keys detects all shh keys ad 17 out of 30, 2 keys are split into 3 parts because they had two dots inside, and most of the keys left may not be real API keys\n",
    "\n",
    "=>\n",
    "* detect-secrets has a very low recall (even with no filters), the other secret keywords have many false positives so we can't add them.\n",
    "* our regex seems to have a high recall(very few missed positives/keys)\n",
    "* let's measure its precision by running it on the original code files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using custom data configuration bigcode--pii-for-code-2810c83b744e2a86\n",
      "Found cached dataset json (/Users/loubnabenallal/.cache/huggingface/datasets/bigcode___json/bigcode--pii-for-code-2810c83b744e2a86/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "ds_full = load_dataset(\"bigcode/pii-for-code\", use_auth_token=True, split=\"train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id: 6 for detectedt key: 'f47dbc9c:' in context: s@5.0.2\n",
      "  - @dusk-network/button@5.0.2\n",
      "  - @dusk-network/menu@5.0.2\n",
      "\n",
      "## 5.0.1\n",
      "\n",
      "### Patch Changes\n",
      "\n",
      "- f47dbc9c: Release\n",
      "- Updated dependencies [f47dbc9c]\n",
      "  - @dusk-network/icon@5.0.1\n",
      "  - @dusk-network/helpers@5.\n",
      "\n",
      "id: 6 for detectedt key: 'dc22a3aa:' in context: s@3.0.7\n",
      "  - @dusk-network/button@3.0.7\n",
      "  - @dusk-network/menu@3.0.7\n",
      "\n",
      "## 3.0.6\n",
      "\n",
      "### Patch Changes\n",
      "\n",
      "- dc22a3aa: testing changesets\n",
      "- Updated dependencies [dc22a3aa]\n",
      "  - @dusk-network/icon@3.0.6\n",
      "  - @dusk-network\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from detect_pii import scan_pii_batch_viz\n",
    "\n",
    "examples = ds_full.select(range(100))\n",
    "outputs = scan_pii_batch_viz(examples, key_detector=\"regex\", new_email_regex=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Gibberish detector\n",
    "Adding the gibberish detector removes 173 false positives like:\n",
    "* ar/www/rajkdjango2/bin/python, param_and_buffer_names_set:, Msf::Exploit::Remote::SNMPClient, d2/d24/interfaceZeebe_1_1Client_1_1Api_1_1Builder_1_1IAccessTokenSupplier\n",
    "\n",
    "But also removes 8 hashes like these from files 31, 37:\n",
    "* d3d43ab4e03fdf106b9191f4e0161cfcde3f040e, d3d43ab4e03fdf106b9191f4e0161cfcde3f040e 8d11fab63089a24c8b17063d29a4b0eac359fb41\n",
    "\n",
    "Strings like this e2e_faster_rcnn_R-101-FPN_1x are considered gibberrish and thus detected as keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(\"37697547/e2e_keypoint_rcnn_R-50-FPN_1x\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hello loubna', ' is sha=']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"hello loubna\\n is sha=anna but\\n here is it\"\n",
    "first_index = text.index(\"anna\")\n",
    "lines = text[:first_index].splitlines()\n",
    "lines[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from gibberish_detector import detector\n",
    "Detector = detector.create_from_model('gibberish_data/big.model')\n",
    "Detector.is_gibberish(\"d5/d02/interfaceZeebe_1_1Client_1_1Api_1_1Commands_1_1IPublishMessageCommandStep3\".replace(\"_\", \" \").replace(\"-\", \" \").lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"cb2f8b691ccf3eae9846c67735f413a49befea28\"\n",
    "Detector.is_gibberish(text.lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "Detector.is_gibberish(\"27dcfe42e3fb3422b72ce48b48bf601c0a3e46e850ee72d9bdd17b5863b6e42c\".replace(\"_\", \" \").replace(\":\", \" \").lower())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Email detection\n",
    "\n",
    "* our current regex detects many false positives taht are derivatives of: dusk-network/helpers@4.6.12\n",
    "* bigscience updated regex: can't detect emails well when they are in this format: <email> and also labels dusk-network/helpers@4.6.12 as emails, see https://regex101.com/r/LNwpG1/1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "context e2e_faster_rcnn_X-101-32x8d-FPN_1x: : \"01_33_49.iAX0mXvW\",\n",
      "        \"35857345/e2e_faster_rcnn_R-50-FPN_1x\": \"01_36_30.cUF7QR7I\",\n",
      "        \"35857890/e2e_faster_rcnn_R-101-FPN_1x\": \"01_38_50.sNxI7sX7\",\n",
      "        \"36761737/e2e_faster_rcnn_X-101-32x8d-FPN_1x\": \"06_31_39.5MIHi1fZ\",\n",
      "        \"35858791/e2e_mask_rcnn_R-50-C4_1x\": \"01_45_57.ZgkA7hPB\",\n",
      "        \"35858933/e2e_mask_rcnn_R-50-FPN_1x\": \"01_48_14.DzEQe4wC\",\n",
      "        \"35861795/e2e_m\n",
      "True\n",
      "context e2e_keypoint_rcnn_R-50-FPN_1x: 843/e2e_mask_rcnn_X-101-32x8d-FPN_1x\": \"06_35_59.RZotkLKI\",\n",
      "        \"37129812/e2e_mask_rcnn_X-152-32x8d-FPN-IN5k_1.44x\": \"09_35_36.8pzTQKYK\",\n",
      "        # keypoints\n",
      "        \"37697547/e2e_keypoint_rcnn_R-50-FPN_1x\": \"08_42_54.kdzV35ao\"\n",
      "    }\n",
      "\n",
      "    @staticmethod\n",
      "    def get(name):\n",
      "        if name.startswith(\"Caffe2Detectron/COCO\"):\n",
      "            return ModelCatalog.get_c2_detectron_12_2017_base\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "from detect_pii import scan_pii_batch_viz\n",
    "\n",
    "examples = ds_full.select(range(100))\n",
    "# to use  updated BigScience regex set new_email_regex=True\n",
    "outputs = scan_pii_batch_viz(examples, key_detector=\"regex\", new_email_regex=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test the recall of emails"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using custom data configuration loubnabnl--pii-instances-b56ea5fc2b13487c\n",
      "Found cached dataset parquet (/Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "# this dataset has lists of PII without context\n",
    "ds = load_dataset(\"loubnabnl/pii-instances\", use_auth_token=True, split=\"train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading cached processed dataset at /Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-05bc1724d8ac05a9.arrow\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "189\n"
     ]
    }
   ],
   "source": [
    "# filter samples from ds with an empty email column\n",
    "ds_emails = ds.filter(lambda x: x[\"EMAIL\"] != \"\")[\"EMAIL\"]\n",
    "print(len(ds_emails))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sample 44 is a wrong annotation\n",
    "ds_emails = ds_emails[:44] + ds_emails[45:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fiw an issue with this annotation\n",
    "ds_emails[142] = ds_emails[142][:-1]\n",
    "ds_emails[108] = ds_emails[108].strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "170"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds_emails = list(set(ds_emails))\n",
    "len(ds_emails)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.emails_ip_addresses_detection import detect_email_addresses\n",
    "\n",
    "old_regex_results = []\n",
    "new_regex_results = []\n",
    "new_regex_nb = 0\n",
    "old_regex_nb = 0\n",
    "for key in ds_emails:\n",
    "    output_1 = detect_email_addresses(key, tag_types={\"EMAIL\"}, new_email_regex=False)\n",
    "    output_2 = detect_email_addresses(key, tag_types={\"EMAIL\"}, new_email_regex=True)\n",
    "    if output_1:\n",
    "        old_regex_nb += 1\n",
    "        old_regex_results.append(output_1)\n",
    "    if output_2:\n",
    "        new_regex_nb += 1\n",
    "        new_regex_results.append(output_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nb emails dectected by old regex: 170\n",
      "nb emails dectected by new BS regex: 169\n",
      "number true EMAILS: 170\n"
     ]
    }
   ],
   "source": [
    "print(f\"nb emails dectected by old regex: {old_regex_nb}\")\n",
    "print(f\"nb emails dectected by new BS regex: {new_regex_nb}\")\n",
    "print(f\"number true EMAILS: {len(ds_emails)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of correctly detected strings with old regex: 170\n",
      "number of correctly detected strings with new regex: 169\n"
     ]
    }
   ],
   "source": [
    "def get_nb_detections(results, refs, mode=\"old\"):\n",
    "    res = 0\n",
    "    values = []\n",
    "    for i, elem in enumerate(results):\n",
    "        assert len(elem) == 1\n",
    "        value = elem[0][\"value\"]\n",
    "        if value in refs:\n",
    "            res += 1\n",
    "            values.append(value)\n",
    "        else:\n",
    "            print(f\"\\nwrong detection of {mode} regex at {i} for {value}\\n\")\n",
    "    return res, values\n",
    "\n",
    "res, values = get_nb_detections(old_regex_results, ds_emails)\n",
    "res_new, values_new = get_nb_detections(new_regex_results, ds_emails, mode=\"new\")\n",
    "print(f\"number of correctly detected strings with old regex: {res}\")\n",
    "print(f\"number of correctly detected strings with new regex: {res_new}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "missed emails with the new regex\n",
      "noreply@127.0.0.1\n"
     ]
    }
   ],
   "source": [
    "print(\"missed emails with the new regex\")\n",
    "# three of them were just truncated because they contained dots inside: not sure they are real keys\n",
    "for key in ds_emails:\n",
    "    if key not in values_new:\n",
    "        print(key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's a false annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<loffjh@gmail.com>',\n",
       " '<Chris.Mears@monash.edu>',\n",
       " '<pychuang@gwu.edu>',\n",
       " '<nguyenthieu2102@gmail.com>',\n",
       " '<robert.kausch@gmx.net>',\n",
       " 'info@srampos.com',\n",
       " 'mark.samman@gmail.com']"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for i in range(5):\n",
    "    ds_emails[i] = \"<\" + ds_emails[i] + \">\"\n",
    "ds_emails[:7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nb emails dectected by old regex: 7\n",
      "nb emails dectected by new BS regex: 2\n",
      "number true EMAILS: 7\n"
     ]
    }
   ],
   "source": [
    "from utils.emails_ip_addresses_detection import detect_email_addresses\n",
    "\n",
    "old_regex_results = []\n",
    "new_regex_results = []\n",
    "new_regex_nb = 0\n",
    "old_regex_nb = 0\n",
    "for key in ds_emails[:7]:\n",
    "    output_1 = detect_email_addresses(key, tag_types={\"EMAIL\"}, new_email_regex=False)\n",
    "    output_2 = detect_email_addresses(key, tag_types={\"EMAIL\"}, new_email_regex=True)\n",
    "    if output_1:\n",
    "        old_regex_nb += 1\n",
    "        old_regex_results.append(output_1)\n",
    "    if output_2:\n",
    "        new_regex_nb += 1\n",
    "        new_regex_results.append(output_2)\n",
    "\n",
    "print(f\"nb emails dectected by old regex: {old_regex_nb}\")\n",
    "print(f\"nb emails dectected by new BS regex: {new_regex_nb}\")\n",
    "print(f\"number true EMAILS: {len(ds_emails[:7])}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "ref = [(1, 5), (6, 9)]\n",
    "pred = [(1, 10)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = {\"TP\": 0, \"FN\": 0, \"FP\": 0}\n",
    "alpha = 0.8\n",
    "beta = 0.8\n",
    "\n",
    "def overlapped(a, b, alpha=0.1, beta=0.1):\n",
    "    \"\"\"Returns True if the intervals a and b overlap for more than 80% of their lengths\"\"\"\n",
    "    size_overlap = max(0, min(a[1], b[1]) - max(a[0], b[0]))\n",
    "    ref_overlap = size_overlap / (b[1] - b[0])\n",
    "    pred_overlap = size_overlap / (a[1] - a[0])\n",
    "    return (ref_overlap > alpha and pred_overlap > beta)\n",
    "\n",
    "# use index so to recover the original data\n",
    "detection_indice = {\"TP_pred\": set(), \"TP_ref\": set(), \"FN\": set(), \"FP\": set()}\n",
    "for i, interval in enumerate(pred):\n",
    "    for j, target in enumerate(ref):\n",
    "        if overlapped(interval, target, alpha, beta):\n",
    "            # the prediction is a true positive\n",
    "            scores[\"TP\"] += 1\n",
    "            detection_indice['TP_pred'].add(i)\n",
    "            detection_indice['TP_ref'].add(j)\n",
    "            break\n",
    "    else:\n",
    "        # the prediction is a false positive\n",
    "        scores[\"FP\"] += 1\n",
    "        detection_indice['FP'].add(i)\n",
    "# the rest of the targets that aren't detected are false negatives\n",
    "detection_indice[\"FN\"] = set(range(len(ref))) - detection_indice[\"TP_ref\"]\n",
    "scores[\"FN\"] = len(detection_indice[\"FN\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compare_intervals(ref_intervals, pred_intervals):\n",
    "    \"\"\"Compare two lists of intervals and return the number of true positives, false positives and false negatives\n",
    "    authur : @copilot\n",
    "    \"\"\"\n",
    "    ref_intervals = sorted(ref_intervals, key=lambda x: x[0])\n",
    "    pred_intervals = sorted(pred_intervals, key=lambda x: x[0])\n",
    "    ref_idx = 0\n",
    "    pred_idx = 0\n",
    "    ref_len = len(ref_intervals)\n",
    "    pred_len = len(pred_intervals)\n",
    "    ref_matched = [False] * ref_len\n",
    "    pred_matched = [False] * pred_len\n",
    "    while ref_idx < ref_len and pred_idx < pred_len:\n",
    "        ref_interval = ref_intervals[ref_idx]\n",
    "        pred_interval = pred_intervals[pred_idx]\n",
    "        if overlapped(ref_interval, pred_interval):\n",
    "            ref_matched[ref_idx] = True\n",
    "            pred_matched[pred_idx] = True\n",
    "        if ref_interval[1] < pred_interval[1]:\n",
    "            ref_idx += 1\n",
    "        else:\n",
    "            pred_idx += 1\n",
    "    metrics = {\n",
    "        'TP_ref': sum(ref_matched),\n",
    "        'TP_pred': sum(pred_matched),\n",
    "        'FN': ref_len - sum(ref_matched),\n",
    "        'FP': pred_len - sum(pred_matched),\n",
    "    }\n",
    "    return metrics, ref_matched, pred_matched"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'TP_ref': 2, 'TP_pred': 1, 'FN': 0, 'FP': 0}, [True, True], [True])"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compare_intervals(ref, pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'TP_pred': set(), 'TP_ref': set(), 'FN': {0, 1}, 'FP': {0}}"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "detection_indice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'TP': 0, 'FN': 2, 'FP': 1}"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.4 ('venv')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "fd8fde6f83dada9276d12fdb71d773558994168ed1b3bea457b8db38c02aa2e1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
